148-29: An Approach to Record-Linkage Using Propensity Score
نویسنده
چکیده
Traditional uses of propensity score involve bias reduction in matching a treatment with a non-randomized control group. The propensity score controls for patient demographics and other background covariates while producing a single scalar summary variable. Distance or nearest available matching between the case and the control group is then performed based on the single score with comparative ease. The same scoring method but different matching techniques are proposed as an approach to record linkage between two or more mock datasets. Each dataset is comprised of shared and unshared patient records, and the datasets lack a common single patient identifier. Common multiple key identifiers are used as the covariate vector in the SAS/STAT LOGISTIC procedure, which generates the single score representing the propensity of existing in one dataset relative to the other. The propensity score can then be match-merged to link records belonging to the same patient as defined by the independent fields. A simplified illustration is provided, and a discussion of this technique of record linkage is accompanied by considerations in deployment, including weaknesses, strengths and practical implications. INTRODUCTION Logistic regression assigns each record a propensity score, which is an estimate of the probability of an assignment to a particular group given a vector of observed covariates (Rosenbaum and Rubin, 1984). Popular uses of the propensity score include one-to-one and one-to-many matches that are based on distance-metric methodologies, weighting schemes, or matching on ranges. By controlling for demographic and other characteristics of a patient, two comparison groups are produced, as in a case-control match. An approach for linkage of same rather than comparable subjects is proposed for patients that exist across multiple data sources and for whom no single primary-key identifier exists. For each record, a single score is produced by SAS/STAT LOGISTIC procedure and is derived using identifiers such as demographics, gender and temporal and geographical characteristics of the patient. Although exact matching of the score between data sources is required, the covariates can be selectively fuzzed or imputed prior to running the logistic model. The approach taken is empirical and therefore any conclusions drawn from these results may be limited to the examples provided. OBJECTIVES The goal of this paper is to identify the identical patient, rather than the customary similar patient, in two or possibly more datasets using the propensity score generated by the logistic model. If the propensity score is a unique representation based on the covariate vector of each observation, then an individual match-merge on such a score will yield the same patient as defined by the set of covariates. This is in contrast to the nearest available matching proposed by Rubin (1973). THE DATASETS The three mock datasets found in APPENDIX A offer mock Registry, Clinical, and Hospital Discharge data on patients. Individually they serve their own purpose but when combined they offer synergistic value to research. APPENDIX A contains an example of how 3 datasets are matched. THE COVARIATES AND THE OUTCOME The covariates that uniquely predict the outcome in the logistic model consist of STATE, HOSPITAL ID, ADMIT DATE, DISCHARGE DATE, AGE, GENDER and PATIENT FIRST MIDDLE and LAST INITIAL. In this example, the outcome variable named SOURCE is a field coded as 1 for REGISTRY dataset, 2 for the HSPDISCH dataset and 3 for the CLINICAL dataset. All covariate and outcome fields must be converted into their numeric equivalents. PREPARING THE DATASETS To prepare the datasets for logistic regression, the records in each dataset must be appended but the variables must first undergo a standardized naming and format convention. The system ID is applied to each dataset prior to stacking. In addition to the system ID, the covariate and outcome fields ideally are the only fields to be carried forward in the initial stack. After successful linkage, the original source data can be merged back into the matched dataset using the system ID. Separating unnecessary fields from the variables used in the model reduces the impact of record inflation on system resources. Other methods that ultimately reduce the impact of record inflation and of SUGI 29 Posters
منابع مشابه
Probabilistic Linkage of Persian Record with Missing Data
Extended Abstract. When the comprehensive information about a topic is scattered among two or more data sets, using only one of those data sets would lead to information loss available in other data sets. Hence, it is necessary to integrate scattered information to a comprehensive unique data set. On the other hand, sometimes we are interested in recognition of duplications in a data set. The i...
متن کاملAssessing Nonresponse Bias and Measurement Error Using Statistical Matching
The estimation of nonresponse bias and measurement error share the problem of usually not having a criterion to assess the quality of the estimate. Nonresponse bias analysis often uses responders within the survey sample who are in some way similar to nonresponders to estimate the potential bias. This depends on the variables within the survey being related to both the likelihood of responding ...
متن کاملThe Effect of Inflation Targeting on Indirect Tax Performance in Selected Countries Using Propensity Score Matching Model
Inflation targeting framework has become a predominant monetary approach across the globe. Williams (2015) believes that in a very real sense, almost all economies are inflation targeters -either explicit or implicit- now.(1) Due to the increasing spread of this policy, it is necessary to consider the way it affects macroeconomic variables. using prevalent economic models for evaluating the eff...
متن کاملA method to detect single-nucleotide polymorphisms accounting for a linkage signal using covariate-based affected relative pair linkage analysis
We evaluate an approach to detect single-nucleotide polymorphisms (SNPs) that account for a linkage signal with covariate-based affected relative pair linkage analysis in a conditional-logistic model framework using all 200 replicates of the Genetic Analysis Workshop 17 family data set. We begin by combining the multiple known covariate values into a single variable, a propensity score. We also...
متن کاملAn Impact Estimator Using Propensity Score Matching: People’s Business Credit Program to Micro Entrepreneurs in Indonesia
P eople’s business credit program (KUR) has been launched to alleviate poverty through provision of micro financing to micro entrepreneurs in Indonesia This study aims to estimate the impact of KUR program using cross-sectional data and propensity score matching technique (PSM). The survey was conducted on 332 household entrepreneurs, consisting of 155 KUR receivers and 177 non-KUR r...
متن کامل